Autor: Roberto Muñoz
E-mail: rmunoz@uc.cl
This notebook shows how to create Series and Dataframes with Pandas. Also, how to read CSV files and creaate pivot tables. The first part is based on the chapter 3 of the Python Data Science Handbook.
In [33]:
import numpy as np
from __future__ import print_function
In [34]:
import pandas as pd
pd.__version__
Out[34]:
In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
Out[3]:
As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:
In [4]:
data.values
Out[4]:
The index is an array-like object of type pd.Index, which we'll discuss in more detail momentarily.
In [5]:
data.index
Out[5]:
Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:
In [6]:
data[1]
Out[6]:
From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Out[7]:
And the item access works as expected:
In [8]:
data['b']
Out[8]:
In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.
In [9]:
population_dict = {'Arica y Parinacota': 243149,
'Antofagasta': 631875,
'Metropolitana de Santiago': 7399042,
'Valparaiso': 1842880,
'Bíobío': 2127902,
'Magallanes y Antártica Chilena': 165547}
population = pd.Series(population_dict)
population
Out[9]:
You can notice the indexes were sorted lexicographically. That's the default behaviour in Pandas
In [10]:
population['Arica y Parinacota']
Out[10]:
Unlike a dictionary, though, the Series also supports array-style operations such as slicing:
In [11]:
population['Metropolitana':'Valparaíso']
Out[11]:
The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.
In [12]:
# Area in km^2
area_dict = {'Arica y Parinacota': 16873.3,
'Antofagasta': 126049.1,
'Metropolitana de Santiago': 15403.2,
'Valparaiso': 16396.1,
'Bíobío': 37068.7,
'Magallanes y Antártica Chilena': 1382291.1}
area = pd.Series(area_dict)
area
Out[12]:
Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:
In [13]:
regions = pd.DataFrame({'population': population,
'area': area})
regions
Out[13]:
In [14]:
regions.index
Out[14]:
In [15]:
regions.columns
Out[15]:
Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:
In [16]:
regions['area']
Out[16]:
In [17]:
pd.DataFrame(population, columns=['population'])
Out[17]:
In [18]:
pd.DataFrame({'population': population,
'area': area}, columns=['population', 'area'])
Out[18]:
In [19]:
regiones_file='data/chile_regiones.csv'
provincias_file='data/chile_provincias.csv'
comunas_file='data/chile_comunas.csv'
regiones=pd.read_csv(regiones_file, header=0, sep=',')
provincias=pd.read_csv(provincias_file, header=0, sep=',')
comunas=pd.read_csv(comunas_file, header=0, sep=',')
In [20]:
print('regiones table: ', regiones.columns.values.tolist())
print('provincias table: ', provincias.columns.values.tolist())
print('comunas table: ', comunas.columns.values.tolist())
In [21]:
regiones.head()
Out[21]:
In [22]:
provincias.head()
Out[22]:
In [23]:
comunas.head()
Out[23]:
In [24]:
regiones_provincias=pd.merge(regiones, provincias, how='outer')
regiones_provincias.head()
Out[24]:
In [25]:
provincias_comunas=pd.merge(provincias, comunas, how='outer')
provincias_comunas.head()
Out[25]:
In [26]:
regiones_provincias_comunas=pd.merge(regiones_provincias, comunas, how='outer')
regiones_provincias_comunas.index.name='ID'
regiones_provincias_comunas.head()
Out[26]:
In [27]:
#regiones_provincias_comunas.to_csv('chile_regiones_provincia_comuna.csv', index=False)
In [35]:
data_file='data/chile_demographic.csv'
data=pd.read_csv(data_file, header=0, sep=',')
data
Out[35]:
In [36]:
data.sort_values('Poblacion')
Out[36]:
In [37]:
data.sort_values('Poblacion', ascending=False)
Out[37]:
In [40]:
(data.groupby(['Region'])['Poblacion','Superficie'].sum())
Out[40]:
In [48]:
(data.groupby(['Region'])['Poblacion','Superficie'].sum()).sort_values('Poblacion', ascending=False)
Out[48]:
In [49]:
data.sort_values(['RegionID']).groupby(['RegionID','Region'])['Poblacion','Superficie'].sum()
Out[49]:
In [ ]: